An Embedded Bayesian Network Hidden Markov Model for Digital Forensics
نویسندگان
چکیده
In the paper we combine a Bayesian Network model for encoding forensic evidence during a given time interval with a Hidden Markov Model (EBN-HMM) for tracking and predicting the degree of criminal activity as it evolves over time. The model is evaluated with 500 randomly produced digital forensic scenarios and two specific forensic cases. The experimental results indicate that the model fits well with expert classification of forensic data. Such initial results point out the potential of such Dynamical Bayesian Network methods for the analysis of digital forensic data. 1 Forensics Evidence and Its Temporal Metadata Structure Digital forensic evidence corresponds to the dataset used to decide whether a crime has been committed and can provide a link between a crime and its victim or a crime and its perpetrator [1]. The evidence can be sourced from storage devices (disks, discs etc.), networks (e.g., packet data, routing tables, logs), embedded digital systems (mobile phones, PDAs), telecommunications traffic, and so on. Evidence not only includes the content of a forensic entity but also meta-data associated with the entity such as, the document tracking history, timestamps, users, authors etc. Of particular interest to digital forensic investigations are conditions where the evidence is available in the form of forensic entities (eg., a MS Word document, an email message, a web page that was browsed etc.) that are generally time-stamped (eg., time-stamps for document files, cached web pages etc.) and attributed (eg., name of the document author). Some of these entities may be related to one or more other forensic entities (eg., the topic of the contents of a web page and of an email may be the same). In such cases forensic evidence is rich in content, meta-data and relations in which evidence instances are linked together in a complex temporal directed multi-graph. Many research groups are working in inducing graphs from text documents – some of which are working in the temporal domain [2] but few of them induce the graphs from a combination of text data (sourced from documents, emails etc.) and the rich set of meta-data (time-stamps, sender/receivers, etc.) that we obtain in digital forensics. However, to accomplish this we need to develop a taxonomy that encompasses both the text data and meta-data in a temporal context. 460 O. De Vel et al. The major objective of digital forensic investigations is to extract unusual and interesting events and their causal relationships. Our paper focuses on the analysis of the metadata, the extraction of forensic evidence along a time sequence or timeline, building up a Bayesian network to model interactions between different forensic evidence and then applying the proposed EBN-HMM model — Embedding Bayesian Network Hidden Markov Model for summarizing, learning, inference and interpreting possible criminal activities. The fundamental tasks are to estimate typical digital crime scenario models from data and to infer the most likely criminal acts given current observations and past criminal activities. 2 Evidence Extraction Along a Time Sequence and Classification The example forensic case we deal with in this paper is a synthetic industrial IP leakage scenario. There are two principal suspects (“Tricia”, a manager, and “Arthur”, a system admistrator and software engineer) involved in this case. There are three computers (two PCs, one each belonging to Tricia and Arthur, and one server named “womboyn”) in two different locations. Multiple parallel threads of activities are happening in this case including, a) an unanticipated company restructure (with a document being emailed anonymously prior to the fact), b) non work-related activities, and c) submarine code (the IP) being leaked to a rival company. The end-game is to identify the innocent or criminal party and then identify the sequence of events that lead to identifying the perpetrator of the IP leakage. This is summarized as a graph of observation objects (with time attributes) from which one has to induce the causal graph of events/activities. The forensic data come from three sources. The first consists of email messages including all of the suspects’ emails: sending, receiving, deleting, and draft, etc., during a specific period. The second source of data is the web browsing dataset having attributes such as directories, file type, file name, timestamps, etc, which are extracted from the internet cache folder. The third data set is related to file systems, event logs, document files, etc. Forensic data and their relations are quite complex, even if these data are encountered on a single computer host system where the data could be sourced from webpage cache, emails, event logs, file systems, chat room records and registries. For examples, web pages have URL addresses, images, topic(s), time-stamps, etc. We summarize and simplify the forensic actions being eight nodes and their interactions as shown in Figure 2. “User” is the first Node (labelled N1), the “LogServer” node (N2) includes who logged into the server, the “Browseweb” node (N3) describes when and what web page is being browsed etc. The nodes of “ReadEmail”(N4) and “SendEmail” (N8) correspond to when, who and what topics are transmitted through emails. The nodes for “DownloadSoftware”(N5), “InstallApp”(N6) and “ExecuteApp” (N7) programs indicate when and which programs are being downloaded and executed. According to the eight nodes of forensic actions, all of the forensic data extracted and stored were summarised in the temporal domain. The temporal domain is partitioned into ten convenient time intervals (T1 to T10), where each time interval corresponds to one or more days of case activities. Table 1 is a sample analysis of one suspect’s (Tricia) email messages, showing activities in her emails sent. An Embedded Bayesian Network Hidden Markov Model for Digital Forensics 461 Table 1. An example of a suspect’s (Tricia) email outbox analysis over ten time intervals Time Date Quantity Event brief descriptions T1 8-10/01/2003 5 Contacts Marvin in the rival company. T2 11/01/2003 1 Tries to have a dinner with Marvin and other staff of the rival company. T3 14/01/2003 0 T4 15/01/2003 3 Work-related activities. T5 16/01/2003 2 Tricia is thinking about Marvin’s suggestion and would like to continue the conversation. T6 17/01/2003 3 Push Arthur to finish the report, tense relation exists with Arthur. T7 18-20/01/2003 0 T8 21/01/2003 0 T9 22/01/2003 2 Email to Marvin saying “Got it”, got what? Once again, pushes Arthur to finish the report, which again shows the tense relation with her colleague Arthur. T10 23/01/2003 1 Work-related activities Table 2. Ten degrees of suspiciousness event probabilities (T(p), F(p): see text for details) Suspiciousness Degree T T T -T --T ---F ---F --F -F F Probability 0 0.1 0.25 0.35 0.5 0.6 0.7 0.8 0.9 1 Table 3. Suspiciousness event probabilities for a sample of Tricia’s actions N1 N2 N3 N4 N5 N6 N7 N8 T1 T T T T T--T2 T T--T T---T3 T T T T T4 T T F-T T T5 F-T--F-F F-T6 T TT7 T8 T F---F F T9 T F-F FFT10 T T--T T Actions are labelled to a suspiciousness probability according to a table of suspiciousness degree varying from normal to abnormal (see Table 2). Using Table 2, we map the forensic events of suspects to suspiciousness probabilities. Table 3 shows this for Tricia’s action set over the ten time intervals. 3 EBN-HMM Model for Forensic Evidence The EBN-HMM model — Embedded Bayesian Network Hidden Markov Model, is examined here since it has the advantages of coping with incomplete observation 462 O. De Vel et al. data and reduced uncertainty. The EBN-HMM model applies one of the popular Dynamical Bayesian Networks, Hidden Markov Models [3], for modelling temporal relations in the Bayesian Net (BNet) [4]. That is, the BNet defines the structure of the observations that occur in given intervals and the HMM defines the evolution of the criminal activity over time by the association of specific observation variable values with different degree of criminal activity. That is, the hidden nodes correspond to the “degree of criminality” which is conditioned on both its present (t) observation and the state at the previous time (t-1) as shown in Fig. 1. Fig. 1. EBN-HMM: Hidden Markov Model embedded Bayesian Network observation model In this specific EBN-HMM model, observations are modelled by a Bayesian Net. We refer to this BNet as the Bayesian Forensic Evidence Network Model (BFENM) in Figure 2 to represent the users’ forensic evidence and relations. The Conditional Probability Tables (CPT) was empirically obtained observations. Fig. 2 also presents Fig. 2. The proposed BFENM model for modelling forensic evidence An Embedded Bayesian Network Hidden Markov Model for Digital Forensics 463 a sample of the probability distribution during a specific time interval. There are three observable nodes (shaded) and five hidden ones in the net. Node logServer’s probability is 0.0 (0%), which means unauthorised logging on. A zero probability for the Node DownloadSoftware represents downloading unauthorised software, files,etc. The same representation applies to Node installApp as well. After inference, the probabilities of the other five nodes are obtained. The node “User”’s probability is 0.174 (17.4%), which means the user is more likely to be an unauthorised user. Training and testing data were obtained from a large number of forensic cases produced by Monte Carlo sampling of an expert-provided observation BNet (BFENM) to produce 5,000 samples for training and testing with the addition random selection of observations occurring at any given node so that we could simulate missing observations. At each interval, the data consisted of a vector of eight dimensions, resulting in 5,000 eight-dimensional vectors, and therefore, 500 scenarios over 10 time intervals. K-means vector quantization [5] was applied to clustering discrete resulting in 16 clusters (symbols). The Baum-Welch [3] and supervised training algorithms were used for training the HMMs from annotated (interpreted for degree of suspiciousness) datasets over time. 4 A Specific Forensic Case Study After extracting and labelling the forensic evidence along the timeline when events happened, Table 4 shows the observables and inferred probabilities obtained in the temporal domain for suspect Tricia. A table for Arthur table was also computed. Table 4. Suspect Tricia’s probabilities (in percentages) for observed and inferred actions N1 N2 N3 N4 N5 N6 N7 N8 T1 100 88 100 100 89 100 93 0 T2 100 80 100 100 78 79 84 0 T3 100 86 100 100 85 89 100 80 T4 100 85 100 0 83 86 100 100 T5 0 26 0 0 0 10 5 0 T6 100 80 86 83 73 77 77 100 T7 50 60 40 37 45 47 40 35 T8 100 53 20 0 0 35 0 24 T9 15 28 100 0 28 0 0 0 T10 100 80 0 100 40 60 62 100 We then applied the complete EBN-HMM model to process the Tricia and Arthur datasets and, in particular, the Viterbi (Dynamic Programming) algorithm [6] to determine the most likely type of suspicious activity associated with observations. Important decision-based results are shown in Figures 3a and 3b for Tricia and Arthur. Here, the grey bars correspond to high degrees of suspicion with Tricia having four grey bars and Arthur only one. Arthur’s one occured at T9 where, in fact, there were no suspicious observations at that time, a grey bar only represents warning that the evidence had accumulated to infer that state. One conclusion from this is that Tricia is 464 O. De Vel et al. possibly the criminal party and Arthur is more likely innocent (although he was doing non-work related, but not criminal, activities). Fig. 3a is an example of a sequence of events that could lead to identifying the perpetrator of the IP leakage. yxis: suspiousness dgree x-axis:Temporal domain by T1 to T10 yxis: suspiousness dgree x-axis:Temporal domain by T1 to T10 Fig. 3. (a) Tricia’s evolution of criminal activity over time (T1-T10) Fig. 3. (b) Arthur’s evolution of criminal activity over time (T1-T10) Upon further analysis of a graph of observation objects (with time attributes), we found that there were a few important events relating to Tricia’s activities during time intervals T5, T8, and T9 that may point to Tricia’s culpability. During T5, Tricia had been undertaking unauthorised log-ins into the womboyn server (perhaps she discovered the restructure document) and had browsed non work-related websites. She had been downloading, installing and executing unauthorised software (the anonymous emailer). During interval T8, she downloaded files from the company server (eg, the restructure document) and transferred the IP code from her PC onto removable media (floppy disk), as well as browsed non work-related websites. During T9, Tricia emailed Marvin (in the rival company) “Got it” (presumably the IP which she would eventually pass onto him) and proceeded to anonymously email the restructure document to all staff. These events would need to be investigated to confirm (or refute) her culpability (eg, how did she get access to the IP?).
منابع مشابه
Embedded Bayesian networks for face recognition
The embedded Bayesian networks (EBN) introduced in this paper, are a generalization of the embedded hidden Markov models previously used for face and character recognition. An EBN is defined recursively as a hierarchical structure where the ”parent” node is a Bayesian network (BN) that conditions the EBNs or the observation sequence that describes the nodes of the ”child” layer. With an EBN, on...
متن کاملHierarchical Modeling and Recognition of Manipulative Gesture
In order to achieve natural, proactive, and non-intrusive interaction between humans and robots, the understanding of human actions is a highly relevant task. In this paper, a vision-based method for manipulative gesture recognition is proposed. Different from the traditional trajectory-based approaches, the manipulative actions are modeled not only based on the hand trajectories but also on th...
متن کاملمدل یابی انتشار بیماری های عفونی بر اساس رویکرد آماری بیز
Background and Aim: Health surveillance systems are now paying more attention to infectious diseases, largely because of emerging and re-emerging infections. The main objective of this research is presenting a statistical method for modeling infectious disease incidence based on the Bayesian approach.Material and Methods: Since infectious diseases have two phases, namely epidemic and non-epidem...
متن کاملDBN versus HMM for Gesture Recognition in Human-Robot Interaction
We designed an easy-to-use user interface based on speech and gesture modalities for controling an interactive robot. This paper, after a brief description of this interface and the platform on which it is implemented, describes an embedded gesture recognition system which is part of this multimodal interface. We describe two methods, namely Hidden Markov Models and Dynamic Bayesian Networks, a...
متن کاملLearning Bayesian Network Structure using Markov Blanket in K2 Algorithm
A Bayesian network is a graphical model that represents a set of random variables and their causal relationship via a Directed Acyclic Graph (DAG). There are basically two methods used for learning Bayesian network: parameter-learning and structure-learning. One of the most effective structure-learning methods is K2 algorithm. Because the performance of the K2 algorithm depends on node...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2006